-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Don't retry function upload on 400 or 422 status #478
Conversation
✅ Deploy Preview for open-api ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The direction looks right 👍
Something I would suggest we had is a property (a boolean flag of some sort) that we can use to enable this new behaviour and that way feature flag it on Buildbot's side.
go/porcelain/deploy.go
Outdated
} | ||
|
||
// TODO this changes retry behaviour for the fileUpload case as well. OK? | ||
if apiErr.Code()/100 == 4 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would voice the same concern as @biruwon pointed out here. I'm concerned that retrying all the 4xx calls might be a stretch? (i.e. should we retry 429's too?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to just skip the retry for 400 and 422 status codes 🙂
I added this to the |
This is ready for review. We set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, let me know what you all think 👍
go/porcelain/deploy.go
Outdated
sharedErr.mutex.Lock() | ||
sharedErr.err = operationError | ||
sharedErr.mutex.Unlock() | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I had to spend some time looking through this but I believe there's some room for improvement here (in the retry strategy I mean 😅)
I was initially going to say that we didn't want to return nil
here but then I got that the idea is to actually exit the retry early. We have a check at the beggining of the retry method where we error out if sharedErr
is set:
open-api/go/porcelain/deploy.go
Lines 493 to 496 in 99d733e
if sharedErr.err != nil { sharedErr.mutex.Unlock() return fmt.Errorf("aborting upload of file %s due to failed upload of another file", f.Name) }
However this case is also unideal because it just means we'll keep on retrying and erroring out here until we exhaust the count.
I was reading through the docs for backoff
- https://pkg.go.dev/github.com/cenkalti/backoff/v4#Retry - and it seems like they have support for a PermanentError
- https://pkg.go.dev/github.com/cenkalti/backoff/v4#PermanentError - which is a way for us to exit out of the retry loop while keeping the "error semantics" correct. It seems like this would be the perfect fit for this case? What do you think?
@@ -287,7 +287,7 @@ func TestUploadFiles_Cancelation(t *testing.T) { | |||
for _, bundle := range files.Files { | |||
d.Required = append(d.Required, bundle.Sum) | |||
} | |||
err = client.uploadFiles(ctx, d, files, nil, fileUpload, time.Minute) | |||
err = client.uploadFiles(ctx, d, files, nil, fileUpload, time.Minute, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we could add a test here which exercises the new code paths we introduced (i.e. checking we don't retry 400
and 422
errors when uploading).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JGAntunes do you have any pointers for the testing? I seem to be a bit stuck here. We are also having trouble running the test suite locally, which is slowing this down a bit 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about running the test suite locally, if it seems like nothing is running it's probably because of the retries loop.
try running the tests as:
go test -race github.com/netlify/open-api/v2/go/porcelain --run TestUploadFiles400Errors -v
And you should get some output (hopefully)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip @4xposed!! We got the tests running and can see the output which has been very helpful :D
go/porcelain/deploy.go
Outdated
sharedErr := &uploadError{err: nil, mutex: &sync.Mutex{}} | ||
permanentErr := &backoff.PermanentError{Err: nil} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unclear for me why we instantiate permanentErr
here, as could you elaborate?
as far as I know backoff.PermanentError
is to be used as return from a backoff.Retry()
block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially we weren't sure how to use PermanentError, but your other comment pointed us in the right direction I think
go/porcelain/deploy.go
Outdated
} | ||
|
||
if skipRetry && (apiErr.Code() == 400 || apiErr.Code() == 422) { | ||
operationError = permanentErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong, but I think we should wrap the existing error in a backoff.PermanentError{}
(for which backoff
provides the Permanent()
function:
operationError = permanentErr | |
operationError = backoff.Permanent(operationError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think that does make sense. And then when backoff
receives the Permanent
error, it should stop retrying, if I understand correctly.
ctx := gocontext.Background() | ||
|
||
server := httptest.NewServer(http.HandlerFunc(func(rw http.ResponseWriter, _ *http.Request) { | ||
rw.WriteHeader(http.StatusUnprocessableEntity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to properly assert the error message in the requre.Equal
in line 350
we need to make the mock server to return a body with a json that has code:
and message
keys:
rw.WriteHeader(http.StatusUnprocessableEntity) | |
rw.WriteHeader(http.StatusUnprocessableEntity) | |
rw.Header().Set("Content-Type", "application/json; charset=utf-8") | |
rw.Write([]byte(`{"message": "Unprocessable Entity", "code": 422}`)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super helpful and definitely makes sense! -- looking at other test cases I see this is done similarly. However, when we set rw.Write([]byte(`{"message": "Unprocessable Entity", "code": 422}`))
we keep getting the error&{0 } (*models.Error) is not supported by the TextConsumer, can be resolved by supporting TextUnmarshaler interface
.
So then the apiError isn't set and then our test skips over the code we're trying to test and keeps retrying 🙈
Looking at other examples in other tests, it seems like it works to set the response this way when calling client.Operations.UploadDeployFile
, but not when calling client.uploadFiles
-- when calling uploadFiles
the error is thrown from UploadDeployFile
.
We've been trying to debug this by writing the body different ways, trying to marshal the response etc, but haven't had any success yet 🤔 Is there maybe something silly we're missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the error is being thrown from: https://github.com/netlify/open-api/blob/master/go/plumbing/operations/operations_client.go#L4131
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think It might be the order on which those are set, mockserver is a bit quirky at times. try if setting the headers in this order helps:
rw.Header().Set("Content-Type", "application/json; charset=utf-8")
rw.WriteHeader(http.StatusUnprocessableEntity)
rw.Write([]byte(`{"message": "Unprocessable Entity", "code": 422 }`))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 @4xposed thanks so much for the tip! That worked and now as far as I can tell, the server is responding as expected. However, now we're hitting a different problem and struggling to debug it.
We now receive the error with the correct code and message, but when we get to this line
apiErr, ok := operationError.(apierrors.Error)
no matter what we try, apiErr is always nil
and ok
is always false.
operationError
is an error
that looks like:
error(*github.com/netlify/open-api/v2/go/plumbing/operations.UploadDeployFunctionDefault) *{_statusCode: 422, Payload: *github.com/netlify/open-api/v2/go/models.Error {Code: 422, Message: "Unprocessable Entity"}}
In my debugger, I can access the status code by calling either operationError._statusCode
or operationError.Payload.Code
, but not by calling operationError.(apierrors.Error)
. However, I can't call operationError.Payload.Code
in the actual code, because Payload
/Code
isn't defined on the Error type.
So since operationError.(apierrors.Error)
never seems to evaluate to anything and there wasn't a test for the original code, we're uncertain if we're doing something wrong here, or if the original code was actually working as intended.
Are we missing something here? Is there a better way to test this (perhaps non-locally?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@4xposed Thanks, I really appreciate the offer! I'm in Europe now 🎉
Testing manually using the debugger, it looks like everything's working as expected, thanks so much for the fix. The one last thing I'm trying to do to wrap up this PR is to fix my tests:
require.Equal(t, err, "[PUT /deploys/{deploy_id}/files/{path}][422] uploadDeployFile default &{Code:422 Message: Unprocessable Entity}")
doesn't work because the error's now wrapped by the backoff.Permanent
error. Is it sufficient to just requireError(t, err)
here? Or is there a better way to assert this that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pairing might not be necessary since most of this seems to be working, but I'd appreciate you taking another look when you have the time 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could call err.Unwrap()
seems like backoff.PermanentError
has a few methods for this: https://github.com/cenkalti/backoff/blob/v4/retry.go#L129
I don't think it's strictly necessary, but it would be nice to at least have some way that we have the right error and not any error, if that makes sense.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it a shot! I do agree it would be nice to have the right error if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I used Require. ErrorContains
instead and that seemed to do the trick for verifying we have the correct error message
go/porcelain/deploy_test.go
Outdated
// Set SkipRetry to false | ||
err = client.uploadFiles(ctx, d, files, nil, fileUpload, time.Minute, false) | ||
// require.Equal(t, err, "[PUT /deploys/{deploy_id}/files/{path}][422] uploadDeployFile default &{Code:422 Message: Unprocessable Entity}") | ||
require.Equal(t, attempts, 12) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of attempts doesn't seem to be consistent, so maybe there's a better way to assert this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@4xposed this is one other tiny detail that sometimes makes the test fail - every once in a while it retries 13 times 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think (emphasis on think) it retries infinitely until the context
times out or it's cancelled.
I think checking that there's more than 1 attempt (or some arbitrary number that's not too high) might be enough, maybe?
I will try to look more into the retry mechanism tomorrow morning to try to figure out if we can have more control in the test of how many times we retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check to just make sure that the attempts are greater than 1 :) That should probably be sufficient for what we're testing
# WIPDon't retry function upload on 4xx. A 4xx response
When a function upload receives a 400 or 422 status code, we know that retrying the upload will fail. This change returns early rather than retrying.